Whole-Genome Comparison of Representatives of All Variants of SARS-CoV-2, Including Subvariant BA.2 and the GKA Clade

Since its discovery at the end of 2019, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has rapidly evolved into many variants, including the subvariant BA.2 and the GKA clade. Genomic clarification is needed for better management of the current pandemic as well as the possible reemergence of novel variants. The sequence of the reference genome Wuhan-Hu-1 and approximately 20 representatives of each variant were downloaded from GenBank and GISAID. Two representatives with no track of in-definitive nucleotides were selected. The sequences were aligned using muscle. The location of insertion/deletion (indel) in the genome was mapped following the open reading frame (ORF) of Wuhan-Hu-1. The phylogeny of the spike protein coding region was constructed using the maximum likelihood method. Amino acid substitutions in all ORFs were analyzed separately. There are two indel sites in ORF1AB, eight in spike, and one each in ORF3A, matrix (MA), nucleoprotein (NP), and the 3′-untranslated regions (3′UTR). Some indel sites and residues/substitutions are not unique, and some are variant-specific. The phylogeny shows that Omicron, Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support. In conclusion, whole-genome comparison of representatives of all variants revealed indel patterns that are specific to SARS-CoV-2 variants or subvariants. Polymorphic amino acid comparison across all coding regions also showed amino acid residues shared by specific groups of variants. Finally, the higher transmissibility of BA.2 might be due at least in part to the 48 nucleotide deletions in the 3′UTR, while the seem-to-be extinction of GKA clade is due to the lack of genetic advantages as a consequence of amino acid substitutions in various genes.


Introduction
Te rapid evolution of severe acute respiratory syndrome 2 virus (SARS-CoV-2), the causative agent of the coronavirus 2019 (COVID-19) pandemic, requires immediate scientifc clarifcation to better manage the current pandemic and to serve as a reference for the possible emergence of novel variants. Since it was discovered at the end of 2019 [1], the original virus has evolved into many variants, which could be attributed to specifc clinical consequences [2]. Established variants of concern (VOCs) are Alpha, Beta, Gamma, Delta, and Omicron; Lambda and Mu are variants of interest (VOIs), and GH/490R is a variant under monitoring (VUM). Some other clusters of viruses that are closely related to Omicron, the so-called BA.2 and GKA clades, which carry molecular markers of Delta and Omicron variants, require special attention. Originally, detection of the GKA clade, popularly known as Deltacron, led to arguments that it may be the result of a sequencing error [3]. However, there is an increasing number of SARS-CoV-2 whole-genome sequences labelled as clade GKA with the notifcation "this submission requires investigation! It appears to contain markers of multiple lineages from both Delta and Omicron variants" in the database.
Large submissions of whole genome sequences pose a major computational challenge, and some portions of submitted sequences contain a long track of nondefnitive nucleotides. Here, we identify insertion/deletion (indel) and amino acid substitution patterns in the whole genome of representative variants, including the BA.2 subvariant and the GKA clade.

Materials and Methods
Te sequence of the reference genome of SARS-CoV-2 strain Wuhan-Hu-1 (accession number NC_045512) was downloaded from GenBank. Ten to 20 complete sequences of each defnitive variant as well as the subvariant BA.2 and GKA clade were selected randomly from GISAID and downloaded. Two representatives of each variant with no undetermined nucleotide of "N" or other IUPAC nucleotide codes were selected. Te dataset identifer is EPI_SET ID: EPI_SET_230223kx; doi: 10.55876/gis8.230223kx. Sequences with a single N or in-defnitive nucleotide were accepted. Te sequences were aligned using muscle in MEGA-X software [9]. Te locations of deletions/insertions in the genome of SARS-CoV-2 were mapped following the open reading frame of Wuhan-Hu-1, as available in the GenBank fle. ORFs were analyzed separately to determine the efects of mutations and deletion/insertion.
Using the corresponding ORF of the coding region of Wuhan-Hu-1, the frst 15 nucleotides of the 5′-terminus were searched, and the sequence prior to the marked sequence was deleted. Te last 15 nucleotides of Wuhan-Hu-1 were used. Te selected sequences were translated into amino acid sequences and aligned using MEGA-X software [9]. Using the same software, the data were exported in Mega format and analyzed further for polymorphic amino acids. We identifed amino acids that were consistently substituted from Wuhan-Hu-1 across all variants and amino acids shared by the Omicron, BA2, and GKA lineages, Omicron and BA.2, Delta and GKA, Delta, Omicron, BA.2, and GKA, as well as Omicron and GKA. Te fnal fasta fle of the data set is available in Supplementary Material 1.
Te phylogeny of the spike protein coding region of the representatives of variants was constructed using the maximum likelihood method and JTT matrix-based model [10] conducted in MEGA-X software [9]. Te phylogenetic tree was rooted to Wuhan-Hu-1 sequence.

Results
Te indel pattern and its location in the whole genome of representatives of various variants of SARS-CoV-2 are presented in Table 1. Tere are two indel sites in ORF1AB, eight in spike, and one each in ORF3A, MA, NP, and 3′UTR. No indel occurs in IGS. Some indel sites are not unique, as they occur in more than one variant. D21605-21613 is unique to Omicron and DA.2 variants, and D21965-21967 is unique to the Alpha variant; I21968-21971 and D26143-26146 are unique to the Mu variant, and D28351-28359 is unique to the Omicron and DA.2 variants; and D29723-29748 is unique to the DA.2 variant. Some indels occur in one representative of the variant.
All polymorphic amino acids of all proteins of two representatives of each variant of SARS-CoV-2 are listed in Supplementary Material 2. A summary of unique amino acids across entire genes in at least one of the representative strains of SARS-CoV-2 variants is presented in Table 2. Amino acids consistently substituted from Wuhan-Hu-1 across all variants are ORF1AB P4715L/F and spike D618G. Regarding the amino acids shared by Omicron, BA2, and GKA, there are 10 in ORF1AB, 21 in spike, and one each in ORF3A, ORF6, and NP. Tree deletions in NP are unique to Omicron and BA.2. Exclusive to Delta and GKA are two deletions in ORF1AB, four in spike, three in NP, and two in ORF7A and ORF8; spike amino acids shared by Delta, Omicron, BA.2, and GKA occur only once. Tree insertions, Ins216E, Ins217P, and Ins218E, are unique to Omicron and GKA. GH/490 harbors 16 variants specifc to Wuhan-Hu-1, namely, fve in ORF1AB, eight in spike, two in NP, and one in ORF3A. Unique amino acid substitutions from Wuhan-Hu-1 to GKA clade were nine amino acids in ORF1AB, namely, E352D, A1306S, P2046L, A2529V, I2820V, V2930L, T3646A, P4715F, and A6319V, one in NP, namely, G215C, two in ORF3A, namely, S92L and D155Y, and one in ORF7B of T40I.
Te topology of phylogenetic analysis of the spike protein gene of two representatives of each variant of SARS-CoV-2 is presented in Figure 1. Te phylogeny shows that Omicron, Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support.

Discussion
Te genetic diversity of coronaviruses occurs through mutation and recombination, as it has been described for SARS-CoV-2 too [11]. Although the RNA-dependent RNA polymerases of coronaviruses possess proof-reading capacity [12], the virus still undergoes mutation, which might lead to amino acid replacement. Such changes impact the biology of the virus as well as the clinical manifestation of its infection. Recombination involves viral RNA merging with other  Virus strains and GISAID codes WH-1 is Wuhan -Hu-1, A-1   Numbers in parentheses were the number of unique amino acids in the corresponding gene; numbering was based on ORFs of Wuhan-Hu-1, except for spike; * numbering in spike protein from 145 to 215 is Wuhan-Hu-1 numbering plus 1 due to insertion in Mu variant, while numbering of spike from 219 or higher is Wuhan-hu-1 numbering plus 4 due to the insertion of EPE residues in omicron.

Advances in Virology
RNAs, either its own RNA, the RNA of other viruses, or cellular RNA; thus, template switching occurs during transcription [13]. Tis process leads to RNA indels. Mutations in SARS-CoV-2 prior to the emergence of variants have been reported [6]. In HIV, the deletions occurred by at least three diferent mechanisms: (i) misalignment of the growing point; (ii) incorrect synthesis and termination in the primer-binding sequence during the synthesis of the plusstrand strong-stop DNA; and (iii) incorrect synthesis and termination before the primer-binding sequence during synthesis of the plus-strand strong-stop DNA [14]. Previous whole-genome comparisons have been conducted, including for Omicron [15]. However, that work focused on phylogeny and did not cover the recently identifed BA.2 and GPA lineages, which are colloquially known as Deltacron. Indels and amino acid substitutions unique to specifc variants were not described.
Trough random selection of variant representatives with defnitive sequences across the genome, we managed to identify unique patterns of indels and amino acid substitutions. Even with only two representatives for each variant, which is indeed the limitation of this study, we identifed quasispecies or, in the case of a variant, quasivariant. Viral quasispecies refers to a population structure that consists of extremely large numbers of variant genomes, termed mutant spectra, mutant swarms, or mutant clouds [16]. For SARS-CoV-2, this phenomenon has been discovered even in single infected individuals [17][18][19][20][21]. We proposed the term quasivariant, as many indels and amino acid substitutions occur in one of only two representatives. We believe that we will fnd more variation if we analyze more variant representatives.
Te variant that harbors the most variant-specifc substitution from Wuhan-Hu-1 is VUM GH/490. Both representatives show fve, eight, two, and one amino acid substitutions in ORF1AB, spike, NP, and ORF3A, respectively. Tis VUM is being tracked in Europe, Africa, Asia, and America; however, the genome frequency for access of GISAID dated March 30, 2022, is lower than 0.3%.
Te GKA clade does not comprise a unique variant. It harbors no unique indels or substitutions compared with Wuhan-Hu-1, but it does share 34 amino acid replacements from Wuhan-Hu-1 with Omicron and BA2, 13 with Delta, one in spike with Delta, Omicron, and BA.2. Tree insertions in the spike in one representative of Ins216E, Ins217P, and Ins218E of the GKA clade are shared with Omicron. Te molecular signatures of Delta and Omicron are obvious in the GKA clade. It is plausible that the GKA clade is an Omicron subvariant. We suggest that the clade is not the result of sequencing errors, as previously thought [3].
Te GKA clade seems to have no genetic advantage, so it becomes extinct shortly after its discovery. Tere are only 89 full genome sequences tagged with GKA clade upon access to the GISAIS database on May 3 rd , 2022. Te collection date of the earliest sequence was dated on January 20, 2022, and the last one was dated on March 21, 2022. Tis clade poses nine amino acid changes from the reference strain of Wuhan-Hu-1 in ORF1AB, two in ORF3A, and one in NP and ORF7B. No unique amino acid change was observed in the spike protein. Te clade seems to be suppressed by antibodies to other variants following previous natural infection and/or vaccination.
Interestingly, we identifed a truncated ORF3A in the Mu variant. Deletion of four nucleotides generates a stop codon; thus, ORF3A in this variant is 257 amino acids in length, whereas the others are 275 residues long. Tis accessory protein contributes to the pathogenesis of SARS-CoV-2 by inducing pathological apoptosis [32]. Te efect of the Mu variant at the cellular level has not yet been described. One article on this variant covered the neutralization efect of antibodies [33]. According to the GISAID database accessed on March 30, 2022, this variant has been identifed in many countries, with a maximum global genome frequency of less than 1%, which has declined recently.  Figure 1: Te phylogeny of the spike protein coding region of the representatives of variants of SARS-CoV-2. Te phylogeny was constructed using the maximum likelihood method and JTT matrix-based model [10] conducted in MEGA-X software [9]. Te phylogenetic tree was rooted to Wuhan-Hu-1 sequence. Te tree with the highest log likelihood is shown. Te percentage of trees in which the associated taxa clustered together is shown next to the branches.
BA.2 difers from Omicron in the deletion of 48 nucleotides from the 3′UTR. Te 3′UTR of coronaviruses contains all cis-acting sequences necessary for viral replication and binds to cellular as well as the viral components nsp1 and N proteins [34], which are required for minus-strand RNA synthesis [35]. Tis has also been described in SARS-CoV-2, whereby the 3′UTR is involved in genomic dimerization and interacts with cellular microRNA [36]. BA.2 has recently increased in frequency in multiple regions of the world, suggesting that it has a selective advantage over Omicron [37][38][39][40]. Te genome frequency of BA.2 has increased exponentially to 90% of total Omicron submissions, as based on GISAID accessed on the previous date. As the original SARS-CoV-2 has a basic reproduction number (R0) of 2.4-3 [41], Delta has an R0 of 5 [42], and Omicron has an R0 of estimated to be higher than 10 or three times greater than Delta [43]; additionally, BA.2 subvariant might have an R0 of 15 or higher. Te higher transmissibility of BA.2 might be attributed, at least in part, to the shorter 3′UTR, which results in a higher speed of viral replication, which needs to be investigated further. However, because the coding region across the whole genome, particularly for the spike protein of BA.2, is very close to that of Omicron, people who survived Omicron infection should be naturally protected against BA.2.
Phylogenetic analysis (Figure 1) demonstrated that the BA2 and GKA subvariants are Omicron variant. Te phylogeny shows that Omicron, GKA/Deltacron, and BA2 are clustered together and separated from other variants with 100% bootstrap support.

Conclusion
Whole-genome comparison of representatives of all variants revealed indel patterns that are specifc to SARS-CoV-2 variants or subvariants. Polymorphic amino acid comparison across all coding regions also showed amino acid residues shared by specifc groups of variants. Finally, the higher transmissibility of BA.2 might be due at least in part to the 48 nucleotide deletions in the 3′UTR, which result in a higher speed of viral replication, while the seem-to-be extinction of GKA clade is due to the lack of genetic advantage as a consequence of amino acid substitutions in various genes.

Data Availability
All genome sequences and associated metadata in this dataset are published in the GISAID's EpiCov database. Te fnal dataset is available at GISAID with identifer EPI_S-ET_230223kx, doi: 10.55876/gis8.230223kx. To view the contributors and each individual sequence with details such as the accession number, virus name, collection date, originating lab and submitting lab, and the list of authors, we need to visit 10.55876/gis8.230223kx. All polymorphic amino acids of all proteins of two representatives of each variant of SARS-CoV-2 are listed in Supplementary Material 1.

Disclosure
An earlier version of the manuscript has been presented as a preprint in https://www.researchsquare.com/article/rs-1526043/v1.

Conflicts of Interest
Te authors declare that they have no conficts of interest.